Search CORE

49 research outputs found

Entropy-scaling search of massive biological data

Author: Berger Bonnie
Daniels Noah M.
Danko David Christian
Yu Y. William
Publication venue: 'Elsevier BV'
Publication date: 01/06/2015
Field of study

Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

arXiv.org e-Print Archive

Elsevier - Publisher Connector

DSpace@MIT

PubMed Central

Clustered Hierarchical Entropy-Scaling Search of Astronomical and Biological Data

Author: Daniels Noah M.
Ishaq Najib
Student George
Publication venue: DigitalCommons@URI
Publication date: 01/01/2019
Field of study

Both astronomy and biology are experiencing explosive growth of data, resulting in a “big data” problem that stands in the way of a “big data” opportunity for discovery. One common question asked of such data is that of approximate search (ρ–nearest neighbors search). We present a hierarchical search algorithm for such data sets that takes advantage of particular geometric properties apparent in both astronomical and biological data sets, namely the metric entropy and fractal dimensionality of the data. We present CHESS (Clustered Hierarchical Entropy-Scaling Search), a search tool with virtually no loss in specificity or sensitivity, demonstrating a 13.6 × speedup over linear search on the Sloan Digital Sky Survey’s APOGEE data set and a 68 × speedup on the GreenGenes 16S metagenomic data set, as well as asymptotically fewer distance comparisons on APOGEE when compared to the FALCONN locality-sensitive hashing library. CHESS demonstrates an asymptotic complexity not directly dependent on data set size, and is in practice at least an order of magnitude faster than linear search by performing fewer distance comparisons. Unlike locality-sensitive hashing approaches, CHESS can work with any user-defined distance function. CHESS also allows for implicit data compression, which we demonstrate on the APOGEE data set. We also discuss an extension allowing for efficient k-nearest neighbors search

arXiv.org e-Print Archive

Crossref

DigitalCommons@URI

CLUSTERED HIERARCHICAL ANOMALY AND OUTLIER DETECTION ALGORITHMS

Author: Daniels Noah M.
Howard Thomas J., III
Ishaq Najib
Publication venue: DigitalCommons@URI
Publication date: 24/11/2021
Field of study

Anomaly and outlier detection is a long-standing problem in machine learning. In some cases, anomaly detection is easy, such as when data are drawn from well-characterized distributions such as the Gaussian. However, when data occupy high-dimensional spaces, anomaly detection becomes more difficult. We present CLAM (Clustered Learning of Approximate Manifolds), a manifold mapping technique in any metric space. CLAM begins with a fast hierarchical clustering technique and then induces a graph from the cluster tree, based on overlapping clusters as selected using several geometric and topological features. Using these graphs, we implement CHAODA (Clustered Hierarchical Anomaly and Outlier Detection Algorithms), exploring various properties of the graphs and their constituent clusters to find outliers. CHAODA employs a form of transfer learning based on a training set of datasets, and applies this knowledge to a separate test set of datasets of different cardinalities, dimensionalities, and domains. On 24 publicly available datasets, we compare CHAODA (by measure of ROC AUC) to a variety of state-of-the-art unsupervised anomaly-detection algorithms. Six of the datasets are used for training. CHAODA outperforms other approaches on 16 of the remaining 18 datasets. CLAM and CHAODA scale to large, high-dimensional “big data” anomalydetection problems, and generalize across datasets and distance functions. Source code to CLAM and CHAODA are freely available on GitHub1

DigitalCommons@URI

Going the distance for protein function prediction: a new distance metric for protein interaction networks

Author: Cao Mengfei
Cowen Lenore J.
Crovella Mark E.
Daniels Noah M.
Hescott Benjamin
Park Jisoo
Zhang Hao
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Due to an error introduced in the production process, the x-axes in the first panels of Figure 1 and Figure 7 are not formatted correctly. The correct Figure 1 can be viewed here: http://dx.doi.org/10.1371/annotation/343bf260-f6ff-48a2-93b2-3cc79af518a9In protein-protein interaction (PPI) networks, functional similarity is often inferred based on the function of directly interacting proteins, or more generally, some notion of interaction network proximity among proteins in a local neighborhood. Prior methods typically measure proximity as the shortest-path distance in the network, but this has only a limited ability to capture fine-grained neighborhood distinctions, because most proteins are close to each other, and there are many ties in proximity. We introduce diffusion state distance (DSD), a new metric based on a graph diffusion property, designed to capture finer-grained distinctions in proximity for transfer of functional annotation in PPI networks. We present a tool that, when input a PPI network, will output the DSD distances between every pair of proteins. We show that replacing the shortest-path metric by DSD improves the performance of classical function prediction methods across the board.MC, HZ, NMD and LJC were supported in part by National Institutes of Health (NIH) R01 grant GM080330. JP was supported in part by NIH grant R01 HD058880. This material is based upon work supported by the National Science Foundation under grant numbers CNS-0905565, CNS-1018266, CNS-1012910, and CNS-1117039, and supported by the Army Research Office under grant W911NF-11-1-0227 (to MEC). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

CiteSeerX

Boston University Institutional Repository (OpenBU)

Directory of Open Access Journals

PubMed Central

CLAM-Accelerated K-Nearest Neighbors Entropy-Scaling Search of Large High-Dimensional Datasets via an Actualization of the Manifold Hypothesis

Author: Daniels Noah M.
Howard III Thomas J.
Ishaq Najib
McLaughlin Oliver
Prior Morgan E.
Publication venue
Publication date: 11/09/2023
Field of study

Many fields are experiencing a Big Data explosion, with data collection rates outpacing the rate of computing performance improvements predicted by Moore's Law. Researchers are often interested in similarity search on such data. We present CAKES (CLAM-Accelerated

K

-NN Entropy Scaling Search), a novel algorithm for

k

-nearest-neighbor (

k

-NN) search which leverages geometric and topological properties inherent in large datasets. CAKES assumes the manifold hypothesis and performs best when data occupy a low dimensional manifold, even if the data occupy a very high dimensional embedding space. We demonstrate performance improvements ranging from hundreds to tens of thousands of times faster when compared to state-of-the-art approaches such as FAISS and HNSW, when benchmarked on 5 standard datasets. Unlike locality-sensitive hashing approaches, CAKES can work with any user-defined distance function. When data occupy a metric space, CAKES exhibits perfect recall.Comment: As submitted to IEEE Big Data 202

arXiv.org e-Print Archive

MEDFORD: A HUMAN AND MACHINE READABLE METADATA MARKUP LANGUAGE

Author: Ashey Jill
Couch Alva
Cowen Lenore J.
Daniels Noah M.
Fonticella Jay-Miguel
Freeman John
Greenberg Jane
McKelvie Hailey
Putnam Hollie
Shpilker Polina
Publication venue: DigitalCommons@URI
Publication date: 17/06/2022
Field of study

Reproducibility of research is essential for science. However, in the way modern computational biology research is done, it is easy to lose track of small, but extremely critical, details. Key details, such as the specific version of a software used or iteration of a genome can easily be lost in the shuffle, or perhaps not noted at all. Much work is being done on the database and storage side of things, ensuring that there exists a space to store experiment-specific details, but current mechanisms for recording details are cumbersome for scientists to use. We propose a new metadata description language, named MEDFORD, in which scientists can record all details relevant to their research. Human-readable, easily-editable, and templatable, MEDFORD serves as a collection point for all notes that a researcher could find relevant to their research, be it for internal use or for future replication. MEDFORD has been applied to coral research, documenting research from RNA-seq analyses to photo collections

PubMed Central

DigitalCommons@URI

Germline DDX41 mutations cause ineffective hematopoiesis and myelodysplasia

Author: Chlon Timothy M
Choi Kwangmin
Daniels Noah J
Gurnari Carmelo
Haferlach Torsten
Hershberger Courtney E
Hueneman Kathleen M
Kuenzi Davis Ashley
Maciejewski Jaroslaw P
Padgett Richard A
Starczynowski Daniel T
Stepanchick Emily
Zheng Yi
Publication venue: 'Elsevier BV'
Publication date: 04/11/2021
Field of study

DDX41 mutations are the most common germline alterations in adult myelodysplastic syndromes (MDSs). The majority of affected individuals harbor germline monoallelic frameshift DDX41 mutations and subsequently acquire somatic mutations in their other DDX41 allele, typically missense R525H. Hematopoietic progenitor cells (HPCs) with biallelic frameshift and R525H mutations undergo cell cycle arrest and apoptosis, causing bone marrow failure in mice. Mechanistically, DDX41 is essential for small nucleolar RNA (snoRNA) processing, ribosome assembly, and protein synthesis. Although monoallelic DDX41 mutations do not affect hematopoiesis in young mice, a subset of aged mice develops features of MDS. Biallelic mutations in DDX41 are observed at a low frequency in non-dominant hematopoietic stem cell clones in bone marrow (BM) from individuals with MDS. Mice chimeric for monoallelic DDX41 mutant BM cells and a minor population of biallelic mutant BM cells develop hematopoietic defects at a younger age, suggesting that biallelic DDX41 mutant cells are disease modifying in the context of monoallelic DDX41 mutant BM

PubMed Central

ART

Transfer of knowledge from model organisms to evolutionarily distant non-model organisms: The coral Pocillopora damicornis membrane signaling receptome

Author: Berger Bonnie
Brenner Nathanael
Cowen Lenore
Daniels Noah M.
Klein-Seetharaman Judith
Klein-Seetharaman Roshan
Kumar Lokender
Lewinski Nastassja A.
Lynn-Goin Matthew
Olaosebikan Monsurat
Putnam Hollie M.
Roger Liza M.
Singh Rohit
Sledzieski Samuel
Yang Jinkyu
Publication venue: DigitalCommons@URI
Publication date: 01/01/2023
Field of study

With the ease of gene sequencing and the technology available to study and manipulate non-model organisms, the extension of the methodological toolbox required to translate our understanding of model organisms to non-model organisms has become an urgent problem. For example, mining of large coral and their symbiont sequence data is a challenge, but also provides an opportunity for understanding functionality and evolution of these and other non-model organisms. Much more information than for any other eukaryotic species is available for humans, especially related to signal transduction and diseases. However, the coral cnidarian host and human have diverged over 700 million years ago and homologies between proteins in the two species are therefore often in the gray zone, or at least often undetectable with traditional BLAST searches. We introduce a two-stage approach to identifying putative coral homologues of human proteins. First, through remote homology detection using Hidden Markov Models, we identify candidate human homologues in the cnidarian genome. However, for many proteins, the human genome alone contains multiple family members with similar or even more divergence in sequence. In the second stage, therefore, we filter the remote homology results based on the functional and structural plausibility of each coral candidate, shortlisting the coral proteins likely to have conserved some of the functions of the human proteins. We demonstrate our approach with a pipeline for mapping membrane receptors in humans to membrane receptors in corals, with specific focus on the stony coral, P. damicornis. More than 1000 human membrane receptors mapped to 335 coral receptors, including 151 G protein coupled receptors (GPCRs). To validate specific sub-families, we chose opsin proteins, representative GPCRs that confer light sensitivity, and Toll-like receptors, representative non-GPCRs, which function in the immune response, and their ability to communicate with microorganisms. Through detailed structure-function analysis of their ligand-binding pockets and downstream signaling cascades, we selected those candidate remote homologues likely to carry out related functions in the corals. This pipeline may prove generally useful for other non-model organisms, such as to support the growing field of synthetic biology

Directory of Open Access Journals

DigitalCommons@URI

Fault Tolerance in Protein Interaction Networks: Stable Bipartite Subgraphs and Redundant Pathways

Author: A Wagner
AR Brady
Arthur Brady
B Papp
C Stark
E Nabieva
FP Roth
G Giaever
GF Berriz
H Karloff
I Ulitsky
JM Cherry
Joel S. Bader
JR Mullen
KH Berger
KS Dimmer
Kyle Maxwell
L Lovasz
LD Hurst
Lenore J. Cowen
LM Blank
M Ashburner
M Wagner
MR Garey
NA Ellis
NM Hollingsworth
Noah Daniels
PM Watt
R Harrison
R Karp
R Kelley
R Kelley
S Bandyopadhyay
SD Oh
T Beissbarth
X Ma
X Zeng
Z Gu
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

As increasing amounts of high-throughput data for the yeast interactome become available, more system-wide properties are uncovered. One interesting question concerns the fault tolerance of protein interaction networks: whether there exist alternative pathways that can perform some required function if a gene essential to the main mechanism is defective, absent or suppressed. A signature pattern for redundant pathways is the BPM (between-pathway model) motif, introduced by Kelley and Ideker. Past methods proposed to search the yeast interactome for BPM motifs have had several important limitations. First, they have been driven heuristically by local greedy searches, which can lead to the inclusion of extra genes that may not belong in the motif; second, they have been validated solely by functional coherence of the putative pathways using GO enrichment, making it difficult to evaluate putative BPMs in the absence of already known biological annotation. We introduce stable bipartite subgraphs, and show they form a clean and efficient way of generating meaningful BPMs which naturally discard extra genes included by local greedy methods. We show by GO enrichment measures that our BPM set outperforms previous work, covering more known complexes and functional pathways. Perhaps most importantly, since our BPMs are initially generated by examining the genetic-interaction network only, the location of edges in the protein-protein physical interaction network can then be used to statistically validate each candidate BPM, even with sparse GO annotation (or none at all). We uncover some interesting biological examples of previously unknown putative redundant pathways in such areas as vesicle-mediated transport and DNA repair

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central